Marketing Campaign Analysis¶

Problem Definition¶

The Context:¶

  • Why is this problem important to solve?
    • To better understand the customer base
    • To understand the customer's behavior and characterstics, which has direct impact on organization's sales and revenue numbers
    • To get insights on the individual cusotmer preferences
    • To have better understanding of the data, to design efficient processes
    • To efficiently use markting budget and increase organization's ROI
    • Help this organization make informed decisions

The objective:¶

  • What is the intended goal?
    • Using the given customer dataset, group the customers by employing various unsupervised learning techniques
    • Identify most effective marketing channels
    • To identify new market opportunities

The key questions:¶

  • What are the key questions that need to be answered?
    • Data Structure - Is it continuous or categorical or mixture of both
    • How to handle missing values and outliers
    • How to determins optimal number of clusters
    • Which metric to use to measure the performance of the algorithm
    • What and all plots to use to effectively visualize and communicate

The problem formulation:¶

  • What is it that we are trying to solve using data science?

    • Investigate and analyse hidden patterns in the given dataset
    • Uses the combination of math and computer science to solve business issues and provide actionable options for business leaders
    • Identify new business opportunities for organizations
    • Using the machine learning models and algorithms, automate the complex business processes
    • Help organizations devise targeted marketing campaigns tailored towards individual customers
    • Same targetted individual campaign is employed in US elections
    • AirBnB is the example for Search Function, behind the scenes using Data Science and Analytics

      Ref:

    1. https://www.mygreatlearning.com/blog/how-data-science-solves-real-business-problems/#:~:text=Data%20science%20solves%20real%20business%20problems%20by%20utilising%20data%20to,models%20to%20get%20actionable%20insights.
    2. https://u-next.com/blogs/data-science/importance-of-data-science/#:~:text=Data%20Science%20enables%20companies%20to,%2C%20policy%20work%2C%20and%20more.
    3. https://e2eml.school/five_questions_data_science_answers.html

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [1]:
# Libraries to read and manipulate data
import numpy as np
import pandas as pd

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To compute distances
from scipy.spatial.distance import cdist
# To compute distances
from scipy.spatial.distance import pdist

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize an elbow curve and Silhouette scores
#from yellowbrick.cluster import SilhouetteVisualizer

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing PCA
from sklearn.decomposition import PCA

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

# To import DBSCAN
from sklearn.cluster import DBSCAN

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To work with datatime
from datetime import datetime

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the ID column which has no null values
In [2]:
# Reading the given dataset
data = pd.read_csv("marketing_campaign.csv")
In [3]:
# First firve rows of the data, just to get a feel of it.
data.head()
Out[3]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [4]:
# Last five rows of the data
data.tail()
Out[4]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 3 4 5 0 0 0 0 0 0 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 2 5 7 0 0 0 1 0 0 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 3 13 6 0 1 0 0 0 0 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 5 10 3 0 0 0 0 0 0 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 1 4 7 0 0 0 0 0 0 1

5 rows × 27 columns

In [5]:
# Shape of the given data
data.shape
Out[5]:
(2240, 27)
In [6]:
# Info of the given dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB

Observations:¶

  1. Both shape and info() are confirming that we have 2240 observations and 27 columns in the given dataset.
  2. All columns have 2240 non-null values, except one column income.
  3. We have three variables/columns of object data type, one column with float and rest are int64 data type.
  4. Dt_Customer column's data type is "Object"!!! May need to update the data type to datetime object.

Let's check for missing values and duplicates.

In [7]:
# Checking for missing values
data.isnull().sum()
Out[7]:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Response                0
dtype: int64
In [8]:
mv_count = data.Income.isnull().sum() # missing values count
non_mv_count = data.Income.notnull().sum() # record count with data
mv_percent = (mv_count / non_mv_count) * 100 # Percent missing values

print('Number of Missing values in Income column: ', mv_count)
print('Number of non missing values in Income column: ', non_mv_count)
print('Percentage of missing values in Income column: ', round(mv_percent, 2))
Number of Missing values in Income column:  24
Number of non missing values in Income column:  2216
Percentage of missing values in Income column:  1.08
In [9]:
#Checking for duplicates
data[data.duplicated()]
Out[9]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response

0 rows × 27 columns

Observations:¶

  • Except income column, rest of the columns have no missing values.
  • Will try to impute these missing values with either mean or median, depending on their skewness or non-skewness.
  • No duplicates in the given dataset.
In [10]:
# Checking for unique values
data.nunique()
Out[10]:
ID                     2240
Year_Birth               59
Education                 5
Marital_Status            8
Income                 1974
Kidhome                   3
Teenhome                  3
Dt_Customer             663
Recency                 100
MntWines                776
MntFruits               158
MntMeatProducts         558
MntFishProducts         182
MntSweetProducts        177
MntGoldProds            213
NumDealsPurchases        15
NumWebPurchases          15
NumCatalogPurchases      14
NumStorePurchases        14
NumWebVisitsMonth        16
AcceptedCmp3              2
AcceptedCmp4              2
AcceptedCmp5              2
AcceptedCmp1              2
AcceptedCmp2              2
Complain                  2
Response                  2
dtype: int64

Observations:¶

  • Looks like ID column might be index, because number of observations and the number of unique values count is equal. This column doesn't add that much value to my analysis, so would drop this column.
In [11]:
# Dropping the "ID" column
data = data.drop(['ID'], axis = 1) # axis=1 for columns and axis=0 for rows
In [12]:
data.info() # making sure the 'ID' column got dropped.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2240 non-null   int64  
 1   Education            2240 non-null   object 
 2   Marital_Status       2240 non-null   object 
 3   Income               2216 non-null   float64
 4   Kidhome              2240 non-null   int64  
 5   Teenhome             2240 non-null   int64  
 6   Dt_Customer          2240 non-null   object 
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp3         2240 non-null   int64  
 20  AcceptedCmp4         2240 non-null   int64  
 21  AcceptedCmp5         2240 non-null   int64  
 22  AcceptedCmp1         2240 non-null   int64  
 23  AcceptedCmp2         2240 non-null   int64  
 24  Complain             2240 non-null   int64  
 25  Response             2240 non-null   int64  
dtypes: float64(1), int64(22), object(3)
memory usage: 455.1+ KB

Observations and Insights from the Data overview: _¶

  • Made multiple observations above after each significant operation.
  • After dropping ID column, we have 2240 observations and 26 columns.
In [13]:
# Making a copy of the dataframe
data_bkup = data.copy()
# data = data_bkup.copy()

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?

*Summary Statistics¶

In [14]:
#Summary statistics for numerical nariables
data.describe().round(2).T
Out[14]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.81 11.98 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.25 25173.08 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.44 0.54 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.51 0.54 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.11 28.96 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.94 336.60 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.30 39.77 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.95 225.72 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.53 54.63 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.06 41.28 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.02 52.17 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.33 1.93 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.08 2.78 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.66 2.92 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.79 3.25 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.32 2.43 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.07 0.26 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.07 0.26 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.07 0.26 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.06 0.25 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.01 0.11 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.01 0.10 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.15 0.36 0.0 0.00 0.0 0.00 1.0

Observations:¶

  • Count column is telling us that all variables have 2240 observations, except the "Income" which has only 2216 observations. We need to impute the missing values.
  • For most of the variables, mean is > median, so we can expect the right skewed distributions.
  • "Complain" & "Response" have mean very close to median, so these two will have normal distribution.
In [15]:
#Summary statistics for non-numerical nariables
data.describe(exclude = 'number').T
Out[15]:
count unique top freq
Education 2240 5 Graduation 1127
Marital_Status 2240 8 Married 864
Dt_Customer 2240 663 31-08-2012 12

Observations:¶

  • Education column has five unique categories with "Graduation" being top qualification of the customers in the given dataset.
  • Marital_Status column has eight unique categories, with more customers being married.
  • Dt_Customer column has 663 unique observations, with August 31st, 2012 being top date.
In [16]:
# ed_values = data.Education.unique() # unique values in the Education column# 
# print(ed_values)

# Let's take closer look at the categorical variables
cat_var = ['Education', 'Marital_Status']

# Counting each unique value in each column
for column in cat_var:
    print(data[column].value_counts())
    
    print('-' * 50)
Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64
--------------------------------------------------
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64
--------------------------------------------------

Observations:¶

  • Education column has five categories, Graduation, PhD, Master, 2n Cycle and Basic respectively.
  • 2n Cycle category has very less number of observations, so will merge this category into Master category.
  • More customers are married in the given dataset.
  • If we take in-relatioship as big picture then more than 50% of customers are in relationship in the given dataset.
  • May be we can merge the above two columns!!!
  • Similarly for Marital_Status column, will merge Divorced, Widow, Alone, Absurd & YOLO categories into Single category.
In [17]:
# Merging categories in Education column
data = data.replace(['2n Cycle'], 'Master')
In [18]:
# Merging categories in Marital_Status column
data = data.replace(['Divorced', 'Widow', 'Alone', 'Absurd', 'YOLO'], 'Single')
In [19]:
# Let's make sure the replace command did its job.

cat_var = ['Education', 'Marital_Status']

# Counting each unique value in each column
for column in cat_var:
    print(data[column].value_counts())
    
    print('-' * 50)
Graduation    1127
Master         573
PhD            486
Basic           54
Name: Education, dtype: int64
--------------------------------------------------
Married     864
Single      796
Together    580
Name: Marital_Status, dtype: int64
--------------------------------------------------
In [20]:
# Let's explore someother variables like Kidhome, Teenhome, Complain & Response
cat_var = ['Kidhome', 'Teenhome', 'Complain', 'Response']

# Counting each unique value in each column
for column in cat_var:
    print(data[column].value_counts())
    
    print('-' * 50)
0    1293
1     899
2      48
Name: Kidhome, dtype: int64
--------------------------------------------------
0    1158
1    1030
2      52
Name: Teenhome, dtype: int64
--------------------------------------------------
0    2219
1      21
Name: Complain, dtype: int64
--------------------------------------------------
0    1906
1     334
Name: Response, dtype: int64
--------------------------------------------------

Observations:¶

  • Only 48 customers have two kids, 899 customers have one kid and 1293 customers don't have kids at all.
  • 52 cutomers have two teens, 1030 customers have one teen and 1158 customer doesn't have teens, that doesn't mean they don't have kids. Could be younger than teens or older than teens.
  • In last two years only 21 customers complained.
  • Only 334 customers accepted the offer in teh last campaign.

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [21]:
# Plotting HIST and BOX plots for each variable
#hist_cols = data.select_dtypes(include = np.number)

hist_cols = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 
             'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth']

#for col in data.columns:
for col in hist_cols:
    print('Histogram and Boxplot for',col)

    print('Skew :', round(data[col].skew(), 2))
    
    plt.figure(figsize = (15, 4))
    
    plt.subplot(1, 2, 1)
    
 # For histogram   
    data[col].hist()
    
    plt.ylabel('count')  # Add "count" to Y axis
    plt.axvline(data[col].mean(), color = 'r', linestyle = '--')      # Add mean to the histogram
    mean = data[col].mean()
    plt.text(mean, plt.ylim()[1]*0.9, f'mean={mean:.2f}', color = 'r', fontsize=10)
   
    plt.axvline(data[col].median(), color = 'black', linestyle = '-') # Add median to the histogram
    median = data[col].median()
    plt.text(median, plt.ylim()[1]*0.8, f'median={median:.2f}', color = 'black', fontsize=10)
    
# For Box plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x = data[col])
    
    #adding labels
    plt.text(y=0.45, x=data[col].min(), s='min')
    plt.text(y=0.45, x=data[col].quantile(0.25), s='Q1')
    plt.text(y=0.45, x=data[col].median(), s='Q2')
    plt.text(y=0.49, x=data[col].median(), s='median')
    plt.text(y=0.45, x=data[col].quantile(0.75), s='Q3')
    plt.text(y=0.45, x=data[col].max(), s='max')
    
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_whisker = Q3 + 1.5 * IQR
    lower_whisker = Q1 - 1.5 * IQR
    p99_5 = data[col].quantile(0.995)
    
    print(f"Upper Whisker: {upper_whisker} & Lower Whisker: {lower_whisker}")
    #upper_whisker_percentile = data[col][data[col] <= upper_whisker].quantile(0.99)
    #print("Upper Whisker percentile value is: ", upper_whisker_percentile)
    print("99.5 percentile value: ", round(p99_5, 2))
    
    num_outliers = data[data[col] > upper_whisker][col].count()
    row_count = data[col].notnull().sum()
    #print("Row Count", row_count)
    print(f"Number of Outliers > Upper Whisker value are: ", num_outliers)
    print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
    
    num_outliers = data[data[col] > p99_5][col].count()
    #print("Row Count", row_count)
    print(f"Number of Outliers > 99.5 percentile value are: ", num_outliers)
    print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
        
    #plt.text(y=0.45, x=lower_whisker, s='lw')
    plt.text(y=0.45, x=upper_whisker, s='uw')
    
    plt.show()
Histogram and Boxplot for Income
Skew : 6.76
Upper Whisker: 118350.5 & Lower Whisker: -14525.5
99.5 percentile value:  102145.75
Number of Outliers > Upper Whisker value are:  8
Those Outliers percentage is : 0.0036
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntWines
Skew : 1.18
Upper Whisker: 1225.0 & Lower Whisker: -697.0
99.5 percentile value:  1373.15
Number of Outliers > Upper Whisker value are:  35
Those Outliers percentage is : 0.0156
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFruits
Skew : 2.1
Upper Whisker: 81.0 & Lower Whisker: -47.0
99.5 percentile value:  184.8
Number of Outliers > Upper Whisker value are:  227
Those Outliers percentage is : 0.1013
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntMeatProducts
Skew : 2.08
Upper Whisker: 556.0 & Lower Whisker: -308.0
99.5 percentile value:  950.02
Number of Outliers > Upper Whisker value are:  175
Those Outliers percentage is : 0.0781
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFishProducts
Skew : 1.92
Upper Whisker: 120.5 & Lower Whisker: -67.5
99.5 percentile value:  241.61
Number of Outliers > Upper Whisker value are:  223
Those Outliers percentage is : 0.0996
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntSweetProducts
Skew : 2.14
Upper Whisker: 81.0 & Lower Whisker: -47.0
99.5 percentile value:  191.8
Number of Outliers > Upper Whisker value are:  248
Those Outliers percentage is : 0.1107
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntGoldProds
Skew : 1.89
Upper Whisker: 126.5 & Lower Whisker: -61.5
99.5 percentile value:  241.0
Number of Outliers > Upper Whisker value are:  207
Those Outliers percentage is : 0.0924
Number of Outliers > 99.5 percentile value are:  11
Those Outliers percentage is : 0.0049
Histogram and Boxplot for NumDealsPurchases
Skew : 2.42
Upper Whisker: 6.0 & Lower Whisker: -2.0
99.5 percentile value:  12.0
Number of Outliers > Upper Whisker value are:  86
Those Outliers percentage is : 0.0384
Number of Outliers > 99.5 percentile value are:  10
Those Outliers percentage is : 0.0045
Histogram and Boxplot for NumWebPurchases
Skew : 1.38
Upper Whisker: 12.0 & Lower Whisker: -4.0
99.5 percentile value:  11.0
Number of Outliers > Upper Whisker value are:  4
Those Outliers percentage is : 0.0018
Number of Outliers > 99.5 percentile value are:  4
Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumCatalogPurchases
Skew : 1.88
Upper Whisker: 10.0 & Lower Whisker: -6.0
99.5 percentile value:  11.0
Number of Outliers > Upper Whisker value are:  23
Those Outliers percentage is : 0.0103
Number of Outliers > 99.5 percentile value are:  4
Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumWebVisitsMonth
Skew : 0.21
Upper Whisker: 13.0 & Lower Whisker: -3.0
99.5 percentile value:  9.8
Number of Outliers > Upper Whisker value are:  8
Those Outliers percentage is : 0.0036
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054

Observations:¶

  • Income variable is right skewed distribution in the dataset. There are only 12 outliers whose value is greater than the 99.5 percentile value of the column. Since there are only 12 in number and percentage wise only 0.5% (0.0054), so would delete these 12 outlier observations. Looking at the Box plot for the Income variable, the upper wishker is falling at 118350, with mean = 52247 and std (sigma) = 25173. Z is coming to 2.69 and the upper wishker falling at 99.64 percentile, which is NORTH of 2sigma (std).

  • MntWines is right skewed distribution. Similar to Income varaible, this one also has 12 outliers whose value is grreater than the 99.5 percentile value.

  • MntFruits, MntMeatProduct, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealPurchases, NumCatalogPurchases are all right skewed distributions, similar to MntWines.
  • NumWebPurchases meana dn median are so close, almost tending to be normal distribution. Also only four outliers with no significant impact.

  • NumWebVisitsMonth is left skewed distribution with mean < median. Like others only 12 outliers whose value is greater than the 99.5 percentile value.

In [22]:
#data = data_bkup.copy()
In [23]:
Q1 = data["Income"].quantile(0.25)
Q3 = data["Income"].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
lower_whisker = Q1 - 1.5 * IQR
p99_5 = data["Income"].quantile(0.995)
#outliers_index = data[data["MntWines"] > p99_5].index
outliers_index = data[data["Income"] > upper_whisker].index
    
print("column outliers index: ", outliers_index)
print(np.size(outliers_index))
column outliers index:  Int64Index([164, 617, 655, 687, 1300, 1653, 2132, 2233], dtype='int64')
8
In [24]:
# deleting above rows
data.drop(outliers_index, inplace = True)
In [25]:
data.info() #checking 8 rows got deleted
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2232 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2232 non-null   int64  
 1   Education            2232 non-null   object 
 2   Marital_Status       2232 non-null   object 
 3   Income               2208 non-null   float64
 4   Kidhome              2232 non-null   int64  
 5   Teenhome             2232 non-null   int64  
 6   Dt_Customer          2232 non-null   object 
 7   Recency              2232 non-null   int64  
 8   MntWines             2232 non-null   int64  
 9   MntFruits            2232 non-null   int64  
 10  MntMeatProducts      2232 non-null   int64  
 11  MntFishProducts      2232 non-null   int64  
 12  MntSweetProducts     2232 non-null   int64  
 13  MntGoldProds         2232 non-null   int64  
 14  NumDealsPurchases    2232 non-null   int64  
 15  NumWebPurchases      2232 non-null   int64  
 16  NumCatalogPurchases  2232 non-null   int64  
 17  NumStorePurchases    2232 non-null   int64  
 18  NumWebVisitsMonth    2232 non-null   int64  
 19  AcceptedCmp3         2232 non-null   int64  
 20  AcceptedCmp4         2232 non-null   int64  
 21  AcceptedCmp5         2232 non-null   int64  
 22  AcceptedCmp1         2232 non-null   int64  
 23  AcceptedCmp2         2232 non-null   int64  
 24  Complain             2232 non-null   int64  
 25  Response             2232 non-null   int64  
dtypes: float64(1), int64(22), object(3)
memory usage: 470.8+ KB
In [26]:
data1 = data.copy() # taking backup
# data = data1.copy()
In [27]:
# Just checking the plots one more time
hist_cols = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 
             'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth']

#for col in data.columns:
for col in hist_cols:
    print('Histogram and Boxplot for',col)

    print('Skew :', round(data[col].skew(), 2))
    
    plt.figure(figsize = (15, 4))
    
    plt.subplot(1, 2, 1)
    
 # For histogram   
    data[col].hist()
    
    plt.ylabel('count')  # Add "count" to Y axis
    plt.axvline(data[col].mean(), color = 'r', linestyle = '--')      # Add mean to the histogram
    mean = data[col].mean()
    plt.text(mean, plt.ylim()[1]*0.9, f'mean={mean:.2f}', color = 'r', fontsize=10)
   
    plt.axvline(data[col].median(), color = 'black', linestyle = '-') # Add median to the histogram
    median = data[col].median()
    plt.text(median, plt.ylim()[1]*0.8, f'median={median:.2f}', color = 'black', fontsize=10)
    
# For Box plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x = data[col])
    
    #adding labels
    plt.text(y=0.45, x=data[col].min(), s='min')
    plt.text(y=0.45, x=data[col].quantile(0.25), s='Q1')
    plt.text(y=0.45, x=data[col].median(), s='Q2')
    plt.text(y=0.49, x=data[col].median(), s='median')
    plt.text(y=0.45, x=data[col].quantile(0.75), s='Q3')
    plt.text(y=0.45, x=data[col].max(), s='max')
    
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_whisker = Q3 + 1.5 * IQR
    lower_whisker = Q1 - 1.5 * IQR
    p99_5 = data[col].quantile(0.995)
    
    print(f"Upper Whisker: {upper_whisker} & Lower Whisker: {lower_whisker}")
    #upper_whisker_percentile = data[col][data[col] <= upper_whisker].quantile(0.99)
    #print("Upper Whisker percentile value is: ", upper_whisker_percentile)
    print("99.5 percentile value: ", round(p99_5, 2))
    
    num_outliers = data[data[col] > upper_whisker][col].count()
    row_count = data[col].notnull().sum()
    #print("Row Count", row_count)
    print(f"Number of Outliers > Upper Whisker value are: ", num_outliers)
    print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
    
    num_outliers = data[data[col] > p99_5][col].count()
    #print("Row Count", row_count)
    print(f"Number of Outliers > 99.5 percentile value are: ", num_outliers)
    print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
        
    #plt.text(y=0.45, x=lower_whisker, s='lw')
    plt.text(y=0.45, x=upper_whisker, s='uw')
    
    plt.show()
Histogram and Boxplot for Income
Skew : 0.01
Upper Whisker: 117930.375 & Lower Whisker: -14444.625
99.5 percentile value:  95158.57
Number of Outliers > Upper Whisker value are:  0
Those Outliers percentage is : 0.0
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntWines
Skew : 1.17
Upper Whisker: 1226.5 & Lower Whisker: -697.5
99.5 percentile value:  1374.35
Number of Outliers > Upper Whisker value are:  35
Those Outliers percentage is : 0.0157
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFruits
Skew : 2.1
Upper Whisker: 79.5 & Lower Whisker: -44.5
99.5 percentile value:  184.84
Number of Outliers > Upper Whisker value are:  248
Those Outliers percentage is : 0.1111
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntMeatProducts
Skew : 1.89
Upper Whisker: 554.125 & Lower Whisker: -306.875
99.5 percentile value:  935.84
Number of Outliers > Upper Whisker value are:  173
Those Outliers percentage is : 0.0775
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFishProducts
Skew : 1.91
Upper Whisker: 120.5 & Lower Whisker: -67.5
99.5 percentile value:  241.69
Number of Outliers > Upper Whisker value are:  223
Those Outliers percentage is : 0.0999
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntSweetProducts
Skew : 2.13
Upper Whisker: 83.5 & Lower Whisker: -48.5
99.5 percentile value:  191.84
Number of Outliers > Upper Whisker value are:  240
Those Outliers percentage is : 0.1075
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntGoldProds
Skew : 1.88
Upper Whisker: 127.125 & Lower Whisker: -61.875
99.5 percentile value:  241.0
Number of Outliers > Upper Whisker value are:  206
Those Outliers percentage is : 0.0923
Number of Outliers > 99.5 percentile value are:  11
Those Outliers percentage is : 0.0049
Histogram and Boxplot for NumDealsPurchases
Skew : 2.32
Upper Whisker: 6.0 & Lower Whisker: -2.0
99.5 percentile value:  11.84
Number of Outliers > Upper Whisker value are:  84
Those Outliers percentage is : 0.0376
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054
Histogram and Boxplot for NumWebPurchases
Skew : 1.39
Upper Whisker: 12.0 & Lower Whisker: -4.0
99.5 percentile value:  11.0
Number of Outliers > Upper Whisker value are:  4
Those Outliers percentage is : 0.0018
Number of Outliers > 99.5 percentile value are:  4
Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumCatalogPurchases
Skew : 1.37
Upper Whisker: 10.0 & Lower Whisker: -6.0
99.5 percentile value:  11.0
Number of Outliers > Upper Whisker value are:  20
Those Outliers percentage is : 0.009
Number of Outliers > 99.5 percentile value are:  1
Those Outliers percentage is : 0.0004
Histogram and Boxplot for NumWebVisitsMonth
Skew : 0.22
Upper Whisker: 13.0 & Lower Whisker: -3.0
99.5 percentile value:  9.84
Number of Outliers > Upper Whisker value are:  8
Those Outliers percentage is : 0.0036
Number of Outliers > 99.5 percentile value are:  12
Those Outliers percentage is : 0.0054

Observation:¶

  • *Income column has become little near to normal distribution.*
  • When I tried to drop more outliers from other columns, the outliers started reappearing. So I felt I might be losing some info or important information.

*Exploring the categorical variables like Education, Kidhome, Teenhome, Complain & Marital_Status.¶

In [28]:
# Creating function for barplots 

def cat_bar_perc(data, z):
    
    total = len(data[z]) # column length
    
    plt.figure(figsize = (15, 5))
    
    ax = sns.countplot(data[z], palette = 'Paired')
    
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
        
        x = p.get_x() + p.get_width() / 2 - 0.05                    # plot width
        
        y = p.get_y() + p.get_height()                              # plot height
        
        ax.annotate(percentage, (x, y), size = 12)                  
        
    plt.show()      
In [29]:
z = ['Education', 'Kidhome', 'Teenhome', 'Complain', 'Response', 'Marital_Status']

for col in z:
    cat_bar_perc(data, col)

Observations:¶

  • Education In the given dataset more than 50% customers have undergad degree, 21% has doctorate, 25% customers have Masters and only 2.4% have basic education.
  • Kidhome Almost 58% customers don't have kids, 40% have one kid and only 2% have two kids at home.
  • Teenhome Again 51% customers don't have teens, 46% have teens and only 2.3% have teen at home.
  • Complain Less than 1% customer ever logged a complaint, so even it we delete this variable, it may not have any significant impact on our analysis.
  • Response - Only 15% accepted the offers in the last campaign.
  • Marital_Status Almost 39% customers are married, 26% living together and 36% are single.

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [30]:
plt.figure(figsize = (15, 15))

sns.heatmap(data.corr(), annot = True)

plt.show()

Observations:¶

  • Looking at the above I feel like I might have to run thie above one more time, after feature engineering and imputing the missing values.
  • Variables NumCatlogPurchasses and MntMeatProducts are highly correlated.
  • Next high correlation is between MntWines and NumStorePurchases, NumCatlogPurchases.

*relationship of numerical variables with categorical variables¶

Numerical Variables: NumStorePurahases, MntWines, NumCatlogPurchases & MntMeatProducts etc.,
Categorical Variables: Education, Marital_Status, Kidhome & Teenhome etc.,
In [31]:
# Categorical varibales Vs. **NumStorePurchases**
cat_col = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']

for col in cat_col:
    plt.figure(figsize = (10, 10))  
    #sns.boxplot(data['Education'], data['NumStorePurchases']) # x-axis: categorical variable, y-axis: numerical variable
    #sns.boxplot(x = data[col], y = data['NumStorePurchases'])
    
    sns.barplot(x = data[col], y = data['NumStorePurchases'], data= data)
    plt.ylabel('NumStorePurchases')
    
    plt.show()

Observations:¶

  • Education Vs. NumStorePurchases - customers with doctorate degree are doing more store purchases when compared to Masters or undergrad degree customers.
  • Marital_Status Looks like marital status doesn't have any impact on the number of store purchases.
  • Kidhome Customers with two kids have less NumSorePurchases.
  • Teenhome On contrast to above customers with two teens slightly more store purchages than other categories.
In [32]:
# print(data['Income'].dtype)
# data['Income'] = data['Income'].astype(float)
# print(data['Income'].dtype)
In [33]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2232 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2232 non-null   int64  
 1   Education            2232 non-null   object 
 2   Marital_Status       2232 non-null   object 
 3   Income               2208 non-null   float64
 4   Kidhome              2232 non-null   int64  
 5   Teenhome             2232 non-null   int64  
 6   Dt_Customer          2232 non-null   object 
 7   Recency              2232 non-null   int64  
 8   MntWines             2232 non-null   int64  
 9   MntFruits            2232 non-null   int64  
 10  MntMeatProducts      2232 non-null   int64  
 11  MntFishProducts      2232 non-null   int64  
 12  MntSweetProducts     2232 non-null   int64  
 13  MntGoldProds         2232 non-null   int64  
 14  NumDealsPurchases    2232 non-null   int64  
 15  NumWebPurchases      2232 non-null   int64  
 16  NumCatalogPurchases  2232 non-null   int64  
 17  NumStorePurchases    2232 non-null   int64  
 18  NumWebVisitsMonth    2232 non-null   int64  
 19  AcceptedCmp3         2232 non-null   int64  
 20  AcceptedCmp4         2232 non-null   int64  
 21  AcceptedCmp5         2232 non-null   int64  
 22  AcceptedCmp1         2232 non-null   int64  
 23  AcceptedCmp2         2232 non-null   int64  
 24  Complain             2232 non-null   int64  
 25  Response             2232 non-null   int64  
dtypes: float64(1), int64(22), object(3)
memory usage: 470.8+ KB
In [34]:
# Categorical variables Vs. **Income**
cat_col = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']

for col in cat_col:
    plt.figure(figsize = (10, 10))  
    #sns.boxplot(data['Education'], data['NumStorePurchases']) # x-axis: categorical variable, y-axis: numerical variable
    #sns.boxplot(x = data[col], y = data['Income'])
    
    sns.barplot(x = data[col], y = data['Income'], data= data)
    plt.ylabel('Income')
    #plt.xlabel(data[col])

    plt.show()

Observations:¶

  • Customers with Basic education have less income compared to the other categories in the Education.
  • Customers with two kids have lower income, where as with two teens have higher income.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values
In [35]:
# show all 24 rows with NaN in Income column
data[data.isnull()['Income'] == True]
Out[35]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
10 1983 Graduation Married NaN 1 0 15-11-2013 11 5 5 ... 0 2 7 0 0 0 0 0 0 0
27 1986 Graduation Single NaN 1 0 20-02-2013 19 5 1 ... 0 0 1 0 0 0 0 0 0 0
43 1959 PhD Single NaN 0 0 05-11-2013 80 81 11 ... 3 4 2 0 0 0 0 0 0 0
48 1951 Graduation Single NaN 2 1 01-01-2014 96 48 5 ... 1 4 6 0 0 0 0 0 0 0
58 1982 Graduation Single NaN 1 0 17-06-2013 57 11 3 ... 0 3 6 0 0 0 0 0 0 0
71 1973 Master Married NaN 1 0 14-09-2012 25 25 3 ... 0 3 8 0 0 0 0 0 0 0
90 1957 PhD Married NaN 2 1 19-11-2012 4 230 42 ... 2 8 9 0 0 0 0 0 0 0
91 1957 Graduation Single NaN 1 1 27-05-2014 45 7 0 ... 0 2 7 0 0 0 0 0 0 0
92 1973 Master Together NaN 0 0 23-11-2013 87 445 37 ... 4 8 1 0 0 0 0 0 0 0
128 1961 PhD Married NaN 0 1 11-07-2013 23 352 0 ... 1 7 6 0 0 0 0 0 0 0
133 1963 Graduation Married NaN 0 1 11-08-2013 96 231 65 ... 5 7 4 0 0 0 0 0 0 0
312 1989 Graduation Married NaN 0 0 03-06-2013 69 861 138 ... 5 12 3 0 1 0 1 0 0 0
319 1970 Graduation Single NaN 1 2 23-08-2013 67 738 20 ... 3 10 7 0 1 0 1 0 0 0
1379 1970 Master Together NaN 0 1 01-04-2013 39 187 5 ... 2 6 5 0 0 0 0 0 0 0
1382 1958 Graduation Together NaN 1 1 03-09-2012 87 19 4 ... 0 3 5 0 0 0 0 0 0 0
1383 1964 Master Single NaN 1 1 12-01-2014 49 5 1 ... 0 2 7 0 0 0 0 0 0 0
1386 1972 PhD Together NaN 1 0 02-03-2014 17 25 1 ... 0 3 7 0 0 0 0 0 0 0
2059 1969 Master Together NaN 1 1 18-05-2013 52 375 42 ... 10 4 3 0 0 0 0 0 0 0
2061 1981 PhD Single NaN 1 0 31-05-2013 82 23 0 ... 0 3 6 0 0 0 0 0 0 0
2078 1971 Graduation Married NaN 1 1 03-03-2013 82 71 1 ... 1 3 8 0 0 0 0 0 0 0
2079 1954 Master Together NaN 0 1 23-06-2013 83 161 0 ... 1 4 6 0 0 0 0 0 0 0
2081 1955 Graduation Single NaN 0 1 18-10-2013 95 264 0 ... 1 5 7 0 0 0 0 0 0 0
2084 1943 Master Single NaN 0 0 30-10-2013 75 532 126 ... 5 11 1 0 0 1 0 0 0 1
2228 1978 Master Together NaN 0 0 12-08-2012 53 32 2 ... 0 1 0 0 1 0 0 0 0 0

24 rows × 26 columns

In [36]:
#data = data1.copy()
In [37]:
income_mean = data['Income'].mean() #calculate the mean of the column Income
print('Income Column mean: ',income_mean)
print(type(income_mean))
Income Column mean:  51633.63813405797
<class 'float'>
In [38]:
#data['Income'] = data['Income'].replace(np.NaN, 'income_mean') # replace the NaN with calculated mean value

#print(data['Income'].dtype)

Observation:

  • When using .replace the Income column's data type is being changed to Object type. But need Income in numeric format for down the line processing.
In [39]:
# will try fillna than replace.
print(data['Income'].dtype)
data['Income'] = data['Income'].fillna(income_mean)
print(data['Income'].dtype)
float64
float64
In [40]:
data[data.isnull()['Income'] == True] # checking for missing values
Out[40]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response

0 rows × 26 columns

In [41]:
data.isnull().sum() # Making sure no missing values in the entire dataset.
Out[41]:
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
dtype: int64

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [42]:
# Engineer "age" variable.
curr_year = 2016
#curr_year

data['age'] = curr_year - data['Year_Birth']
data['age'].sort_values()
Out[42]:
1170     20
46       20
696      21
747      21
1850     21
       ... 
424      75
1950     76
192     116
339     117
239     123
Name: age, Length: 2232, dtype: int64

Observations:¶

  • According to Google as well as ChatGPT, oldest person ever lived on this earth is 122, a French woman Jeanne Calment. So will go with the cut-off of 122 and drop the anomalies.
In [43]:
# Rows of column "age" > 122
rowsgt122 = data[data['age'] > 122].index
print(rowsgt122)
Int64Index([239], dtype='int64')
In [44]:
# drop the rows with age > 122
data.drop(rowsgt122, inplace = True)
In [45]:
data['age'].sort_values() # Checking to make sure the identified rows got dropped.
Out[45]:
46       20
1170     20
747      21
1850     21
2213     21
       ... 
2084     73
424      75
1950     76
192     116
339     117
Name: age, Length: 2231, dtype: int64

Observations:¶

  • One row with index 239, whose age > 122 got droped.
  • Now we are left with 2231 rows.
In [46]:
# Engineer the "total_kids" variable
data['total_kids'] = data['Kidhome'] + data['Teenhome']
In [47]:
#How many members each family has? Need some feature engineering.
# Marital_Status + total_kids = family_size

# Merging categories in Marital_Status column
data = data.replace(['Together', 'Married'], 'Relationship')
In [48]:
#Engineering new variable "Rel_Status" with numbers for accumulating purposes
data['Rel_Status'] = data['Marital_Status'].replace({'Single' : 1, 'Relationship' : 2}) 
In [49]:
#Calculate Family_Size
data['Family_Size'] = data['Rel_Status'] + data['total_kids']
In [50]:
# Total_Amount_Spent by the customers on various products
data['Total_Amount_Spent'] = data['MntFishProducts'] + data['MntMeatProducts'] + data['MntFruits'] + data['MntSweetProducts'] + data['MntWines'] + data['MntGoldProds']
In [51]:
# Total_Amount_Spent by the customers on various products
data['Total_Purchases'] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases'] 
In [52]:
# How long the customer has been with the company

# Converting Dt_Customer datatype 
data.Dt_Customer = pd.to_datetime(data.Dt_Customer)
#data.Dt_Customer[1]

curr_date = pd.to_datetime('01-01-2016').normalize() # current -date
curr_date 

data['Enrolled_In_Days'] = (curr_date - data.Dt_Customer) / np.timedelta64(1, 'D') #number of days enrolled
In [53]:
# How many offers the customers have accepted?
data['Total_Offers_Accepted'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5'] + data['Response']
In [54]:
# Total Amount spent per purchase
#data['Total_Purchases'] = data['NumDealsPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases'] + data['NumWebPurchases']

data['Amount_Per_Purchase'] = round((data['Total_Amount_Spent'] / data['Total_Purchases']), 2)
In [55]:
# So describing the data, one more time
data.describe().T
Out[55]:
count mean std min 25% 50% 75% max
Year_Birth 2231.0 1.968827e+03 11.885098 1899.00 1959.00 1970.00 1977.000 1996.0
Income 2231.0 5.162981e+04 20605.486323 1730.00 35428.50 51563.00 68118.000 113734.0
Kidhome 2231.0 4.446437e-01 0.538605 0.00 0.00 0.00 1.000 2.0
Teenhome 2231.0 5.069476e-01 0.544702 0.00 0.00 0.00 1.000 2.0
Recency 2231.0 4.912281e+01 28.954103 0.00 24.00 49.00 74.000 99.0
MntWines 2231.0 3.050632e+02 336.801738 0.00 24.00 176.00 505.000 1493.0
MntFruits 2231.0 2.639220e+01 39.826540 0.00 2.00 8.00 33.000 199.0
MntMeatProducts 2231.0 1.653913e+02 219.424360 0.00 16.00 67.00 231.500 1725.0
MntFishProducts 2231.0 3.765845e+01 54.697914 0.00 3.00 12.00 50.000 259.0
MntSweetProducts 2231.0 2.716764e+01 41.330676 0.00 1.00 8.00 34.000 263.0
MntGoldProds 2231.0 4.418512e+01 52.208570 0.00 9.00 25.00 56.500 362.0
NumDealsPurchases 2231.0 2.318691e+00 1.894814 0.00 1.00 2.00 3.000 15.0
NumWebPurchases 2231.0 4.099059e+00 2.774573 0.00 2.00 4.00 6.000 27.0
NumCatalogPurchases 2231.0 2.637382e+00 2.796029 0.00 0.00 2.00 4.000 28.0
NumStorePurchases 2231.0 5.809951e+00 3.241921 0.00 3.00 5.00 8.000 13.0
NumWebVisitsMonth 2231.0 5.332138e+00 2.415652 0.00 3.00 6.00 7.000 20.0
AcceptedCmp3 2231.0 7.306141e-02 0.260296 0.00 0.00 0.00 0.000 1.0
AcceptedCmp4 2231.0 7.485433e-02 0.263215 0.00 0.00 0.00 0.000 1.0
AcceptedCmp5 2231.0 7.306141e-02 0.260296 0.00 0.00 0.00 0.000 1.0
AcceptedCmp1 2231.0 6.454505e-02 0.245776 0.00 0.00 0.00 0.000 1.0
AcceptedCmp2 2231.0 1.299866e-02 0.113294 0.00 0.00 0.00 0.000 1.0
Complain 2231.0 9.412819e-03 0.096584 0.00 0.00 0.00 0.000 1.0
Response 2231.0 1.497087e-01 0.356866 0.00 0.00 0.00 0.000 1.0
age 2231.0 4.717346e+01 11.885098 20.00 39.00 46.00 57.000 117.0
total_kids 2231.0 9.515912e-01 0.751910 0.00 0.00 1.00 1.000 3.0
Rel_Status 2231.0 1.644106e+00 0.478891 1.00 1.00 2.00 2.000 2.0
Family_Size 2231.0 2.595697e+00 0.907201 1.00 2.00 3.00 3.000 5.0
Total_Amount_Spent 2231.0 6.058579e+02 601.452000 5.00 69.00 397.00 1044.500 2525.0
Total_Purchases 2231.0 1.486508e+01 7.621082 0.00 8.00 15.00 21.000 43.0
Enrolled_In_Days 2231.0 9.030309e+02 232.194880 391.00 732.00 904.00 1077.000 1454.0
Total_Offers_Accepted 2231.0 4.482295e-01 0.889846 0.00 0.00 0.00 1.000 5.0
Amount_Per_Purchase 2231.0 inf NaN 0.53 9.71 23.38 45.455 inf
In [56]:
col2_dtype = data['Amount_Per_Purchase'].dtype

print(f"The datatype of 'Amount_Per_Purchase' is {col2_dtype}")

inf_count = np.isinf(data['Amount_Per_Purchase']).sum()

print(f"There are {inf_count} infinity values in 'Amount_Per_Purchase'")
The datatype of 'Amount_Per_Purchase' is float64
There are 2 infinity values in 'Amount_Per_Purchase'
In [57]:
# find column(s) with infinity values
inf_cols = data.isin([np.inf, -np.inf]).any()

# get column name(s) with infinity values
inf_col_names = inf_cols[inf_cols == True].index.tolist()

print(f"The following column(s) contain infinity values: {inf_col_names}")
The following column(s) contain infinity values: ['Amount_Per_Purchase']
In [58]:
# get rows with infinity values
inf_rows = data.isin([np.inf, -np.inf]).any(axis=1)

inf_df = data[inf_rows]

print(f"The following rows contain infinity values: \n{inf_df}")
The following rows contain infinity values: 
      Year_Birth   Education Marital_Status  Income  Kidhome  Teenhome  \
981         1965  Graduation         Single  4861.0        0         0   
1524        1973  Graduation         Single  3502.0        1         0   

     Dt_Customer  Recency  MntWines  MntFruits  ...  Response  age  \
981   2014-06-22       20         2          1  ...         0   51   
1524  2013-04-13       56         2          1  ...         0   43   

      total_kids  Rel_Status  Family_Size  Total_Amount_Spent  \
981            0           1            1                   6   
1524           1           1            2                   5   

      Total_Purchases  Enrolled_In_Days  Total_Offers_Accepted  \
981                 0             558.0                      0   
1524                0             993.0                      0   

      Amount_Per_Purchase  
981                   inf  
1524                  inf  

[2 rows x 35 columns]
In [59]:
rows_index = data.index[data.isin([np.inf, -np.inf]).any(1)]
print(rows_index)
Int64Index([981, 1524], dtype='int64')
In [60]:
# Dropping the above rows
data.drop(rows_index, inplace = True)

#print("Dataframe after deleting rows with infinity values:")
#print(data)
In [61]:
# MAking sure no more infinity values
# find column(s) with infinity values
inf_cols = data.isin([np.inf, -np.inf]).any()

# get column name(s) with infinity values
inf_col_names = inf_cols[inf_cols == True].index.tolist()

print(f"The following column(s) contain infinity values: {inf_col_names}")
The following column(s) contain infinity values: []

**No column names displayed. So infinity values gone.

Observation:¶

  • Down the line Summary Statictics gave "inf" issues, so had to come back and handle them here.
  • Since only two rows, negligible impact, so dropped them.

*Atlast Engineering & preprocessing done, yahooo!!!¶

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data

Dropping the variables for whom we engineered the new variables.¶

  • Year_Birth - Captured that information in the "Age" variable.
  • Removing categorial variables like Education & Marital Status - Since the corresponding informationn got transformed into numerical form in other engineered variables.
  • Also all the variables for the campain information captured in Total_Offers_Accepted variable.
  • Dt_Customer - Captured in the "Enrolled in Days" variable.
  • Complain - As we have seen earlier so small number of customer complained, so that small number is not going to impact our analysis, so dropping.
  • Total_Kids & Family Size - are being dropped since keeping Kidhome & Teenhome to see impact. .
In [62]:
pre_processed_data = data.copy() # just trying to save the effort incase "data" df got corrupted
In [63]:
# Dropping all the redundant information columns
processed_data = data.drop(["Year_Birth", "Education", "Marital_Status", "Dt_Customer",
                             "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5",
                             "Response", "Complain", "Family_Size", "total_kids"
                            ],
                            axis=1, inplace = True
                           )
In [64]:
#processed_data = data.drop(["Income", "age", "Family_Size"], axis=1, inplace = True)
In [65]:
data.describe().T
Out[65]:
count mean std min 25% 50% 75% max
Income 2229.0 51672.380132 20565.602889 1730.00 35523.00 51569.00 68118.00 113734.0
Kidhome 2229.0 0.444594 0.538636 0.00 0.00 0.00 1.00 2.0
Teenhome 2229.0 0.507402 0.544735 0.00 0.00 0.00 1.00 2.0
Recency 2229.0 49.132795 28.960156 0.00 24.00 49.00 74.00 99.0
MntWines 2229.0 305.335128 336.830396 0.00 24.00 177.00 505.00 1493.0
MntFruits 2229.0 26.414984 39.837141 0.00 2.00 8.00 33.00 199.0
MntMeatProducts 2229.0 165.538807 219.467513 0.00 16.00 68.00 232.00 1725.0
MntFishProducts 2229.0 37.691790 54.711120 0.00 3.00 12.00 50.00 259.0
MntSweetProducts 2229.0 27.192014 41.341203 0.00 1.00 8.00 34.00 263.0
MntGoldProds 2229.0 44.223867 52.215955 0.00 9.00 25.00 57.00 362.0
NumDealsPurchases 2229.0 2.320772 1.894389 0.00 1.00 2.00 3.00 15.0
NumWebPurchases 2229.0 4.102737 2.773097 0.00 2.00 4.00 6.00 27.0
NumCatalogPurchases 2229.0 2.639749 2.796166 0.00 0.00 2.00 4.00 28.0
NumStorePurchases 2229.0 5.815164 3.238697 0.00 3.00 5.00 8.00 13.0
NumWebVisitsMonth 2229.0 5.324361 2.402730 0.00 3.00 6.00 7.00 20.0
age 2229.0 47.173620 11.889826 20.00 39.00 46.00 57.00 117.0
Rel_Status 2229.0 1.644684 0.478716 1.00 1.00 2.00 2.00 2.0
Total_Amount_Spent 2229.0 606.396590 601.452740 8.00 69.00 397.00 1045.00 2525.0
Total_Purchases 2229.0 14.878421 7.611471 1.00 8.00 15.00 21.00 43.0
Enrolled_In_Days 2229.0 903.145357 232.176186 391.00 732.00 904.00 1077.00 1454.0
Total_Offers_Accepted 2229.0 0.448632 0.890144 0.00 0.00 0.00 1.00 5.0
Amount_Per_Purchase 2229.0 33.304800 45.063796 0.53 9.71 23.35 45.35 1679.0
In [66]:
data_model = data.copy() # just trying to save the effort incase "data" df got corrupted
In [67]:
data_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2229 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Income                 2229 non-null   float64
 1   Kidhome                2229 non-null   int64  
 2   Teenhome               2229 non-null   int64  
 3   Recency                2229 non-null   int64  
 4   MntWines               2229 non-null   int64  
 5   MntFruits              2229 non-null   int64  
 6   MntMeatProducts        2229 non-null   int64  
 7   MntFishProducts        2229 non-null   int64  
 8   MntSweetProducts       2229 non-null   int64  
 9   MntGoldProds           2229 non-null   int64  
 10  NumDealsPurchases      2229 non-null   int64  
 11  NumWebPurchases        2229 non-null   int64  
 12  NumCatalogPurchases    2229 non-null   int64  
 13  NumStorePurchases      2229 non-null   int64  
 14  NumWebVisitsMonth      2229 non-null   int64  
 15  age                    2229 non-null   int64  
 16  Rel_Status             2229 non-null   int64  
 17  Total_Amount_Spent     2229 non-null   int64  
 18  Total_Purchases        2229 non-null   int64  
 19  Enrolled_In_Days       2229 non-null   float64
 20  Total_Offers_Accepted  2229 non-null   int64  
 21  Amount_Per_Purchase    2229 non-null   float64
dtypes: float64(3), int64(19)
memory usage: 400.5 KB
In [68]:
#Just curious to see how all these variables are correlated to each other
plt.figure(figsize = (15, 15))

sns.heatmap(data.corr(), annot = True)

plt.show()

Observations:¶

  • Amount spent on wines is showing strong correlation to Store and catalog purchases.
  • Amount spent on meat purchases has stronf correlation with catalog purchases.
  • Webvisits has negative correlation with rest of the variables, except NumDealsPurchases.
  • Total offers accepted has medium correlation with amount spent on wines.
  • Amount per purchase has srong correlation amount spent on meat products and total amount spent.

*Scaling the Data¶

In [69]:
scaler = StandardScaler()

data_scaled = pd.DataFrame(scaler.fit_transform(data), columns = data.columns) # Can use data_model or data 
                                                                               # interchangebly. Since both of same data.

data_scaled.head()
Out[69]:
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth age Rel_Status Total_Amount_Spent Total_Purchases Enrolled_In_Days Total_Offers_Accepted Amount_Per_Purchase
0 0.314461 -0.825592 -0.931676 0.306255 0.978946 1.546266 1.733954 2.455412 1.471211 0.838555 ... 2.632856 -0.560587 0.697546 0.994887 -1.346994 1.680648 1.330078 1.976764 0.619554 0.696396
1 -0.259150 1.031365 0.904492 -0.384504 -0.874034 -0.638115 -0.727099 -0.652514 -0.633699 -0.732199 ... -0.586559 -1.178258 -0.135027 1.247260 -1.346994 -0.963545 -1.166714 -1.667838 -0.504112 -0.639344
2 0.969828 -0.825592 -0.931676 -0.798959 0.358317 0.567061 -0.175641 1.340215 -0.149812 -0.042599 ... -0.228846 1.292426 -0.551314 0.321892 0.742394 0.282053 0.804437 -0.172948 -0.504112 0.080908
3 -1.217178 1.031365 -0.931676 -0.798959 -0.874034 -0.562792 -0.663294 -0.506259 -0.585311 -0.751354 ... -0.944272 -0.560587 0.281260 -1.276472 0.742394 -0.920306 -0.903894 -1.926320 -0.504112 -0.592289
4 0.321999 1.031365 -0.931676 1.549621 -0.392972 0.416414 -0.216658 0.151890 -0.004646 -0.559799 ... 0.128866 0.057084 -0.135027 -1.024098 0.742394 -0.306654 0.541617 -0.823462 -0.504112 -0.246257

5 rows × 22 columns

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [70]:
tsne = TSNE(n_components = 2, random_state = 1) ## Apply the t-SNE algorithm with random_state = 1

data_tsne = tsne.fit_transform(data_scaled) # # Fit and transform T-SNE function on the scaled data

data_tsne.shape
Out[70]:
(2229, 2)
In [71]:
data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])

data_tsne.head()
Out[71]:
Component 1 Component 2
0 -17.536951 -43.880280
1 35.045784 -29.042521
2 -34.295082 -10.124798
3 44.926903 48.946674
4 20.414270 42.588531
In [72]:
# Plot a scatter plot to ee the groupings
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1])

plt.show()

Observation and Insights:

  • First two times no clear demarkation of groups.
  • Now redoing the whole notebook, and including some Income, Kidhome and Teenhome and Age etc columns. Now I see clear groups, atleast seven.
In [73]:
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.Income)

plt.show()

Trying to visualize data with different perplexity values

In [74]:
for i in range(10, 50, 5):  # starting with 10 and up to 50 in increments of 5
    tsne = TSNE(n_components = 2, random_state = 1, perplexity = i) # Applying tSNE with perpelxity range 10 thru 50
    
    data_tsne = tsne.fit_transform(data_scaled) # Fit and transform T-SNE function on the scaled data
    
    data_tsne = pd.DataFrame(data_tsne) # Converting embeddings to dataframe
    
    data_tsne.columns = ['X1', 'X2']
    
    plt.figure(figsize = (10,10))
    
    sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
    
    plt.title("perplexity = {}".format(i))

Observation:¶

  • At perplexity=15, I see sevn groups (1 large, 4 medium and 2 small groups).
  • At Perplexity=20, I see six groups.
  • At perplexity=25, I see seven groups, combination of big, medium, small and tiny groups.
  • At perplexity=30, also I see seven groups.
  • At perplexity =35 , 40 & 45, the seven group trend continued.

Even though I see seven, but I am hoping they will form three or four resonable size clusters!!! Let's see!!! what PCA tells us.

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?
  • One of the benefits of the PCA is data decorrelation along with other benefits such as dimention reduction, factor analysis, data visualization, data anonymization, data compression to name a few. Also it can be used as data pre-processing tool. We need non correlated variables for better analysis, I feel we have to apply PCA before applying clustering algorithms.
In [75]:
# Figuring out the number of principle components to generate
n = data_scaled.shape[1]
print('Number of Principal Components: ', n)
Number of Principal Components:  22
In [76]:
# Principal Components for the data
# Applying PCA algorith with random_state=1
pca = PCA(n_components = n, random_state = 1) 
data_pca = pd.DataFrame(pca.fit_transform(data_scaled)) # Fit and transform the pca function on scaled data

# The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_
exp_var.cumsum()
Out[76]:
array([0.38335424, 0.48732342, 0.55357456, 0.60720205, 0.65353071,
       0.69897098, 0.73946305, 0.77362121, 0.80576543, 0.83576611,
       0.86108828, 0.88568982, 0.90569495, 0.92453842, 0.94219911,
       0.95873953, 0.97304848, 0.98359186, 0.9934839 , 1.        ,
       1.        , 1.        ])
In [79]:
# Visualizing the explained variance by individual components
plt.figure(figsize = (10, 10))

plt.plot(range(1, 23), exp_var.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variance by Components")
plt.xlabel("Number of Components")
plt.ylabel('Cumulative Explained Variance')

plt.show()

Observation:¶

  • Around 18th or 19th component the curve got flattened. So need to visualize the explained variance by individual components. Also need to figure out the least number of components that can explain more than 90% variance.
In [80]:
# Finding the least number of components that can explain more than 90% variance.
sum = 0

for ix, i in enumerate(exp_var): #its the same or similar code as the elective project.
    sum = sum + i
    if(sum > 0.90):
        print("Numbers of PCs that explain at least 90% of varince: ", ix + 1)
        break
Numbers of PCs that explain at least 90% of varince:  13

Observation:¶

  • Variance Explainability Atleast 80% - 9 PCs
          85% - 11 PCs
          90% - 13 PCs
          97% - 17 PCs 
  • If we select 13PCs then slightly more than 50% of the components may explain more than 85% of the variance.
In [82]:
# Visualizing the explained variance by individual components
plt.figure(figsize = (10, 10))


# Function Definition to add vlue labels to the individual BARs in the plot
def addvaluelabels(x,y):
    for i in range(len(x)):
        plt.text(i+1, y[i]/2, y[i], ha = 'center') #trying to align to center of the bar

plt.bar(range(1, 23), exp_var, alpha = 0.5, align = 'center', label = 'Individual Explained Variance')
plt.plot(range(1, 23), exp_var.cumsum(), marker = 'o', linestyle = '--', label='Cumulative Explained Variance')

#Add value labels
x = range(1,23)
y = np.around(exp_var, 2)
addvaluelabels(x, y)

plt.title("Explained Variance by Components")
plt.xlabel("Number of Components")
plt.ylabel('Cumulative Explained Variance')

plt.legend(loc = 'best')
plt.tight_layout()

plt.show()

Observation and Insights: No visibile break on the plot to eyeball the number of components. As mentioned above will try with 13 PCs.

In [83]:
# Creating a new dataframe with first 13 principal components as columns and original features as indices
cols = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13']

df_pc1 = pd.DataFrame(np.round(pca.components_.T[:, 0:13], 2), index = data_scaled.columns, columns = cols)
df_pc1

def color_high(val):
    if val <= -0.30:
        return 'background: pink'
    
    elif val >= 0.30:
        return 'background: skyblue'   
    
df_pc1.style.applymap(color_high)
Out[83]:
  PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
Income 0.300000 0.010000 -0.180000 0.080000 -0.020000 0.020000 -0.110000 0.030000 -0.070000 -0.040000 0.030000 -0.140000 0.060000
Kidhome -0.220000 -0.010000 0.240000 -0.050000 -0.050000 0.020000 -0.300000 0.480000 0.040000 -0.240000 -0.110000 0.080000 0.070000
Teenhome -0.030000 -0.440000 -0.390000 0.030000 0.030000 -0.020000 0.080000 0.190000 0.010000 0.130000 0.170000 -0.630000 -0.050000
Recency 0.000000 -0.010000 -0.050000 -0.290000 0.830000 0.250000 -0.330000 -0.090000 0.120000 -0.080000 0.060000 -0.040000 -0.060000
MntWines 0.280000 -0.120000 0.050000 0.310000 0.080000 0.110000 -0.100000 -0.090000 -0.110000 0.010000 0.020000 0.030000 -0.120000
MntFruits 0.230000 0.130000 0.060000 -0.300000 -0.100000 -0.100000 0.090000 0.160000 0.050000 -0.160000 0.210000 -0.110000 -0.720000
MntMeatProducts 0.280000 0.170000 0.080000 0.040000 0.070000 0.070000 -0.010000 0.250000 -0.170000 0.040000 -0.080000 0.120000 0.070000
MntFishProducts 0.240000 0.140000 0.060000 -0.300000 -0.070000 -0.080000 0.130000 0.170000 0.090000 -0.170000 -0.030000 0.010000 -0.080000
MntSweetProducts 0.230000 0.110000 0.060000 -0.290000 -0.070000 -0.080000 0.050000 0.090000 0.060000 -0.280000 0.420000 -0.200000 0.570000
MntGoldProds 0.200000 -0.090000 0.120000 -0.160000 -0.000000 -0.110000 0.140000 -0.050000 0.730000 0.390000 -0.320000 -0.040000 0.040000
NumDealsPurchases -0.030000 -0.510000 0.170000 -0.110000 -0.070000 -0.030000 -0.220000 0.430000 -0.060000 -0.000000 -0.210000 0.000000 0.030000
NumWebPurchases 0.200000 -0.350000 0.090000 -0.020000 -0.090000 -0.060000 -0.110000 -0.240000 0.160000 0.060000 0.470000 0.330000 0.110000
NumCatalogPurchases 0.290000 0.010000 -0.000000 0.030000 0.020000 0.020000 -0.070000 -0.010000 0.020000 -0.150000 -0.360000 0.000000 0.110000
NumStorePurchases 0.260000 -0.150000 -0.090000 -0.110000 -0.110000 -0.010000 -0.160000 -0.230000 -0.320000 0.030000 -0.140000 0.000000 -0.130000
NumWebVisitsMonth -0.210000 -0.280000 0.370000 0.080000 0.050000 0.010000 0.100000 -0.000000 0.080000 -0.060000 0.320000 0.190000 -0.220000
age 0.060000 -0.210000 -0.510000 0.140000 0.180000 -0.010000 0.430000 0.210000 0.150000 -0.390000 -0.070000 0.440000 0.000000
Rel_Status -0.010000 -0.030000 -0.050000 -0.130000 -0.340000 0.920000 0.060000 -0.030000 0.140000 -0.050000 0.000000 -0.010000 -0.010000
Total_Amount_Spent 0.330000 0.010000 0.080000 0.110000 0.050000 0.060000 -0.020000 0.070000 -0.040000 0.010000 -0.010000 0.040000 -0.060000
Total_Purchases 0.280000 -0.320000 0.040000 -0.070000 -0.090000 -0.030000 -0.190000 -0.090000 -0.090000 -0.020000 -0.080000 0.120000 0.030000
Enrolled_In_Days 0.040000 -0.220000 0.470000 -0.080000 0.230000 0.090000 0.600000 -0.150000 -0.290000 -0.120000 -0.190000 -0.210000 0.100000
Total_Offers_Accepted 0.140000 0.060000 0.230000 0.610000 0.020000 0.030000 -0.120000 -0.020000 0.340000 -0.370000 0.030000 -0.320000 -0.060000
Amount_Per_Purchase 0.210000 0.130000 0.080000 0.210000 0.180000 0.160000 0.170000 0.460000 -0.110000 0.540000 0.250000 0.100000 0.050000

Observations:¶

  • PC1 has positive correlation with variables MntWines, MntMeatProducts, NumCatalogPurchases, Total_Amount_Spent & Total_Purchases
  • PC2 has negative correlation with variables NumDealsPurchases, NumWebPurchases, & Total_Purchases
  • PC3 has strong positive correlation with Enrolled_In-Days & NumWebVisits.
  • PC4 has strong positive correlation with Total_Offeres_Accepted.
  • PC6 has strong positive correlation with Recency.
  • PC6 thru PC13 have a mixture of both positive and negative correlations with various variables.

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?
  • In the elbow plot (SSE Vs. number of cluster) if we a see sudden change in the direction of the line plot and with consistent linear dip of SSE there after, that point would be the optimal number of clusters.
  • *Silhouette score is used to determins the better or final value of K.*
In [85]:
# Copy of the scaled data to store Labels from each algorithm
# Copy - original object plus reference address
# copy (deep = True) - Original object plus repetitive copies also stored
data_pca_copy = data_pca.copy(deep = True) 

Applying KMeans on the PCA data and visualize the clusters¶

In [86]:
# Step 1
# Dictionary to store SSE (Sum of Squared Error). SSE is nothing but Distortions
sse = {}

# Step 2 
# Iterate for a range of Ks and fit the pca components to the algorithm
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
    sse[k] = kmeans.inertia_ # Store the inertia value for K

# Step 3
# Elbow plot
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-') # 'bx-' ==> blue x
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')

plt.show()

Observation:

  • In the above ::Elbow:: plot at K=2, we saw the SSE dip change direction, but at K=3, the SSE values started linear dip. So for this given dataset the optimum number of clusters could be either three or four.
  • May be will go with 3 !!!
In [87]:
# Silhouette Score
# Dictionary to store Silhouette Score for each value of K
sil_score = {}

# Iterate for a range of Ks and fit the PCs to the algorithm.
for k in range(2, 10):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
    labels = kmeans.predict(data_pca)
    sil_score[k] = silhouette_score(data_pca, labels)

# Elbow Plot
plt.figure()

plt.plot(list(sil_score.keys()), list(sil_score.values()), 'rx-')
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")

#plt.text(list(sil_score.keys()), sil_score, list(sil_score.values()),  ha='center')

plt.show()

Observation:¶

  • At K=4, the Silhouette Score (0.15, guesstimate from the above plot), but clearly see the line got flattened.
  • So I will go with k=4, clusters.
In [88]:
#K-Means Silhouette Score @k=4
kmeans_sc = silhouette_score(data_pca, labels)
print("K-Means Silhouette Score @k=4: ", round(kmeans_sc, 4))
K-Means Silhouette Score @k=4:  0.1076

Cluster Profiling¶

K-Means Cluster Profiling with k=4

In [89]:
kmeans = KMeans(n_clusters = 4, random_state = 1) # Applying the KMeans algorithm
kmeans.fit(data_pca) # fir the kmeans functiomn on the scaled data

# Adding the preicted labels to the copied data and the orifginal data
data_pca_copy['Labels_k4'] = kmeans.labels_ # Save the predictions on the pca components from K-Means
data_model['Labels_k4'] = kmeans.labels_
data_pca['Labels_k4'] = kmeans.labels_
In [90]:
# Number of observations in each cluster
data_model['Labels_k4'].value_counts()
Out[90]:
2    580
1    561
3    547
0    541
Name: Labels_k4, dtype: int64

Observations:¶

Wow!! How I stumbled into such equal distributions!!!

  • So I am thinking, in my earlier submission, I removed all the demographic information and and my groups were not clear. Even I went perplexity upto 700. In this round I added some of that information back with (10, 50, 5) perplexity configuration, I saw clear groupings.
In [91]:
# Calculating summary statistics of the original data for each label
mean = data_model.groupby('Labels_k4').mean()

median = data_model.groupby('Labels_k4').median()

df_kmeans = pd.concat([mean, median], axis = 0)

df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
                   'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_kmeans.T
Out[91]:
group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_0 Median group_1 Median group_2 Median group_3 Median
Income 76434.511188 29304.335456 58988.666322 42364.727797 76982.00 29478.00 59422.000 42557.0
Kidhome 0.035120 0.855615 0.217241 0.669104 0.00 1.00 0.000 1.0
Teenhome 0.147874 0.016043 0.884483 0.967093 0.00 0.00 1.000 1.0
Recency 50.164510 48.171123 47.667241 50.652651 53.00 47.00 48.000 51.0
MntWines 631.449168 31.197861 485.991379 72.396709 575.00 12.00 423.000 44.0
MntFruits 68.469501 6.058824 27.401724 4.652651 54.00 3.00 16.000 2.0
MntMeatProducts 468.961183 24.953654 149.025862 27.137112 430.00 15.00 131.000 17.0
MntFishProducts 100.959335 8.976827 35.770690 6.605119 93.00 4.00 20.000 3.0
MntSweetProducts 71.393715 6.094474 27.437931 4.851920 58.00 3.00 15.000 2.0
MntGoldProds 77.704251 17.040998 66.412069 15.462523 57.00 11.00 47.000 9.0
NumDealsPurchases 1.181146 1.868093 3.720690 2.427788 1.00 1.00 3.000 2.0
NumWebPurchases 5.036969 2.190731 6.712069 2.372943 5.00 2.00 7.000 2.0
NumCatalogPurchases 6.048059 0.491979 3.289655 0.782450 6.00 0.00 3.000 1.0
NumStorePurchases 8.316081 3.074866 8.150000 3.676417 8.00 3.00 8.000 3.0
NumWebVisitsMonth 2.746765 6.985740 5.668966 5.804388 2.00 7.00 6.000 6.0
age 47.578558 37.787879 50.827586 52.524680 47.00 38.00 51.000 52.0
Rel_Status 1.611830 1.643494 1.660345 1.661792 2.00 2.00 2.000 2.0
Total_Amount_Spent 1418.937153 94.322638 792.039655 131.106033 1383.00 60.00 732.000 90.0
Total_Purchases 20.582255 7.625668 21.872414 9.259598 20.00 7.00 21.000 9.0
Enrolled_In_Days 904.826248 896.778966 960.756897 846.925046 908.00 901.00 977.500 828.0
Total_Offers_Accepted 1.055453 0.221034 0.400000 0.133455 1.00 0.00 0.000 0.0
Amount_Per_Purchase 74.297079 11.142139 36.335310 12.278720 67.59 8.62 33.225 11.0
In [92]:
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']

for col in cols_visualise:
    sns.boxplot(x = 'Labels_k4', y = col, data = data_model)
    plt.show()

Observations and Insights: from here we have to update.....¶

  • group_0 seems more affluent when compared to other groups, with other groups in the order of group_3, group_2 and group_2 with respect to mean Income.
  • group_0 and group_2 bought more products when compared to the other two groups.
  • It seems group_2 did more deal purchases when compared to others and most of those purchases could be on web.
  • group_0s Amount_Per_Purchase is very high, followed by group_2 in the middle range and the remaining two groups very low.
  • group_0 & group_2 seems to be taking advantage of the offers offered.
In [93]:
#Visualize Clusters using PCA

#cols_visualise = ['Income','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
#                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
#                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']

#for col in cols_visualise:
    #sns.scatterplot(x = col, y = 'Income', data = data_model, hue = 'Labels_k4', palette = 'Dark2')
sns.scatterplot(x = 0, y = 1, data = data_pca_copy,  hue = 'Labels_k4', palette = 'Dark2')
    
plt.show()

Observations:¶

  • Looking at the above plot, second guessing again that group_1 and group_2 could form one big cluster.
  • group_1 has couple of outliers very far from the group.
  • But to get more vivid insights, the more the clusters is better than the other way around.

Describe the characteristics of each cluster¶

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

Summary of each cluster:¶

group_0¶

This group_3 seems to be most affluent group out of the four groups I was able to discover. Their mean income is above 75K and spending more on Wines, meat and fish products. This group doing more shopping at stores, follwed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. They have less kids when compared to the other groups. When compared to toher groups their number of web visits are less, but they are doing resonable number purchases on web, after store and catalog.

group_2¶

From Income perspective this is the next group with more income after group_3. This group's mean income is about 60K. This group has more teens compared the younger kids. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spedning more on store and web purchases.

group_3¶

This group seems to have both young kids and teen kids. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are visiting web more, but their shopping spread almost equally between store, web and deal purchases. In comparison, this group accepted very less number of offers.

group_1¶

This is the youngest and less affluent group when compared to the other groups with almost no teens and with very few young kids. They are also spending on gold products along with wine and meat products, than on fish products. This group visits web more often than other groups, but their spenign is very prudent, in the order of store, web abd deal purchases.

**Let's see if other clustering algorithms yield anymore insights than K-Means

K-Medoids¶

In [95]:
#data_pca = data_pca1.copy()
#data_pca1 = data_pca.drop(["Labels_k4","kmedoLabels_k4"], axis=1)
In [ ]:
#data_model.info()
In [96]:
kmedo = KMedoids(n_clusters = 4, random_state = 1) #_ Apply the K-Medoids algorithm on the pca components with n_components=2 and random_state=1

kmedo.fit(data_pca) #_ Fit the model on the pca components

data_pca_copy['kmedoLabels_k4'] = kmedo.predict(data_pca)

data_model['kmedoLabels_k4'] = kmedo.predict(data_pca)

#data_pca['kmedoLabels_k4'] = kmedo.predict(data_pca)
In [98]:
#K-Medoids Silhouette Score
labels = kmedo.predict(data_pca)
kmedo_sc = silhouette_score(data_pca, labels)
print("K-Medoids Silhouette Score: ", round(kmedo_sc, 4))
K-Medoids Silhouette Score:  0.1071
In [99]:
# Number of observations in each cluster
data_model.kmedoLabels_k4.value_counts()
Out[99]:
2    772
3    584
1    485
0    388
Name: kmedoLabels_k4, dtype: int64
In [100]:
# Calculating summary statistics of the original data for each label
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
               
mean = data_model.groupby('kmedoLabels_k4').mean()

median = data_model.groupby('kmedoLabels_k4').median()

df_kmedoids = pd.concat([mean, median], axis = 0)

df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean', 
                     'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_kmedoids[cols_visualise].T
Out[100]:
group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_0 Median group_1 Median group_2 Median group_3 Median
Income 29432.234315 34812.944547 55362.002727 75572.404436 30130.5 34738.0 55760.0 76063.0
MntWines 39.376289 24.688660 374.356218 623.864726 18.0 15.0 298.5 567.5
MntFruits 7.018041 3.070103 18.718912 68.863014 4.0 2.0 10.0 54.5
MntMeatProducts 28.636598 15.047423 112.845855 451.130137 17.0 10.0 87.5 415.0
MntFishProducts 10.422680 4.369072 25.380829 99.756849 6.0 2.0 13.0 90.0
MntSweetProducts 6.739691 3.259794 18.424870 72.244863 4.0 2.0 9.5 58.5
MntGoldProds 20.685567 9.525773 51.987047 78.416096 12.5 6.0 34.0 57.0
NumDealsPurchases 2.319588 1.647423 3.494819 1.328767 2.0 1.0 3.0 1.0
NumWebPurchases 2.417526 1.635052 5.549223 5.359589 2.0 1.0 5.0 5.0
NumCatalogPurchases 0.579897 0.331959 2.645078 5.917808 0.0 0.0 2.0 6.0
NumStorePurchases 3.121134 3.008247 7.010363 8.356164 3.0 3.0 7.0 8.0
NumWebVisitsMonth 7.546392 5.925773 5.634715 2.938356 8.0 6.0 6.0 2.0
Total_Amount_Spent 112.878866 59.960825 601.713731 1394.275685 73.0 46.0 530.0 1368.5
Total_Purchases 8.438144 6.622680 18.699482 20.962329 8.0 6.0 19.0 21.0
Total_Offers_Accepted 0.273196 0.107216 0.347150 0.982877 0.0 0.0 0.0 0.0

Observation:¶

Groups labels have shifted between groups!!! Like K-Means I was expecting group_3 and group_0 to be affluent. But looking at below PCA visualization, it confirms that the clusters did not change, but only the labels have shifted.

Visualize the clusters using PCA¶

In [101]:
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']

#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'kmedoLabels_k4', palette = 'Dark2')
    
plt.show()

Observations:¶

  • Cluster shapes remained more or less same as K-Means, but the hue color combination changed, when compared to the K-Means scatter plot.

Cluster Profiling¶

In [102]:
for col in cols_visualise:
    sns.boxplot(x = 'kmedoLabels_k4', y = col, data = data_model)
    
    plt.show()

Observations and Insights:

  • group_3 seems more affluent when compared to other groups, with other groups in the order of group_2, group_1 and group_0 with respect to mean Income.
  • group_3 and group_2 bought more products when compared to the other two groups.
  • It seems group_2 did more deal purchases when compared to others and most of those purchases could be on web.
  • group_3, followed by group_2 seems to be taking advantage of the offers offered.

Characteristics of each cluster¶

Summary for each cluster:

group_3¶

This group_3 seems to be most affluent group out of the four groups. Their mean income is about 75K and spending more on Wines and meat products. The next tier of shopping is on Fish, Gold, Sweet and fruit products. This group doing more shopping at stores, follwed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. When compared to other groups their number of web visits are less, but they are eaually spening on Web and catalog products, after store purchases.

group_2¶

From Income perspective this is the next group with more income after group_3. This group's mean income is around 55k. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spedning more on store and web purchases.

group_1¶

This group's mean income is around 34K. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are 2nd highest among the groups in visiting the web, but their shopping spread almost equally between web and deal purchases. Store purchases are slightly higher than web abd deal purchases. In comparison, this group accepted very less number of offers.

group_0¶

This is less affluent group when compared to the other groups with mean income around 29K. They are also spending on gold products along with wine and meat products. This group visits web more often than any other groups, but their spending is very prudent, in the order of store, web abd deal purchases. Their usage of catalog channel is very low.

**With this K-Medoids algorithm also, same customer behavior observed as ealier in K-Means algorithm. Also observed that the label values are little bit smaller when compared to K-Means label values.

**All the customer trends, with samller numbers, are same between K_Means and K-Medoids.

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [103]:
# List of all linkage methods to check
linkage_methods = ['single', 'average', 'complete']

# list of distance metrics
distance_metrics = ["euclidean", "cityblock", "correlation", "cosine"]

high_cophenet_corr = 0     # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]         

for dm in distance_metrics:
    for lm in linkage_methods:
        # Z - linkage matrix
        Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distances on data_pca1
        #c - cophenet correlation
        [c, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Z - output of linkage function. 
                                                           # cophenet_dist - height of the link
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:      # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c      # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm          # Appending its corresponding distance
            high_dm_lm[1] = lm          # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.7316482357048057.
Cophenetic correlation for Euclidean distance and average linkage is 0.8415725173893369.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7709704860518053.
Cophenetic correlation for Cityblock distance and single linkage is 0.7547761040359486.
Cophenetic correlation for Cityblock distance and average linkage is 0.8387365549064711.
Cophenetic correlation for Cityblock distance and complete linkage is 0.8084480191971283.
Cophenetic correlation for Correlation distance and single linkage is 0.5003386528997378.
Cophenetic correlation for Correlation distance and average linkage is 0.594236294654903.
Cophenetic correlation for Correlation distance and complete linkage is 0.5325071285023306.
Cophenetic correlation for Cosine distance and single linkage is 0.5043462132260496.
Cophenetic correlation for Cosine distance and average linkage is 0.5885397187935416.
Cophenetic correlation for Cosine distance and complete linkage is 0.554279256538745.
In [104]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8415725173893369, which is obtained with Euclidean distance and average linkage.

Dendograms for different Linkages with euclidean distance

In [106]:
# Linkage methods list
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
comp_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size (15, 30)

# Loop thru all the above list of linkage methods
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="euclidean", method=method) # distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")  # dendrogram title

    [cophenet_corr, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{cophenet_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observation:¶

  • Dendogram (Complete Linkage) with euclidean distance plot is little clear when compared average linkage and single linkage.

Now with Cityblock

In [107]:
# Linkage methods list
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
comp_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(20,30))            # Setting the plot size (15, 30)

# Loop thru all the above list of linkage methods
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Cityblock", method=method) # distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")  # dendrogram title

    [cophenet_corr, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{cophenet_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observation:¶

  • With citybock distance, no clear hierarchy plots.

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?
  • Are there any distinct clusters in any of the dendrograms?

Observation¶

  • With Complete linkage with Euclidean distance we are seeing four clusters at threshold around 15.

Visualize the clusters using PCA¶

In [108]:
# Clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters = 4, affinity = 'CityBlock', linkage = 'average')

hierarchical.fit(data_pca)
Out[108]:
AgglomerativeClustering(affinity='CityBlock', linkage='average', n_clusters=4)
In [109]:
data_pca_copy['HCLabels_k4'] = hierarchical.labels_

data_model['HCLabels_k4'] = hierarchical.labels_
In [111]:
# Hierarchical Clustering Silhouette Score
labels = hierarchical.fit_predict(data_pca)
kmedo_sc = silhouette_score(data_pca, labels)
print("Hierarchical Clustering Silhouette Score: ", round(kmedo_sc, 4))
Hierarchical Clustering Silhouette Score:  0.4721
In [112]:
data_model.HCLabels_k4.value_counts()
Out[112]:
0    2223
1       4
2       1
3       1
Name: HCLabels_k4, dtype: int64

Observation:¶

  • Not sure we can deduce any kind of insights with this kind of grouping. But teh Silhouette Score is highest so far.
In [ ]:
# Checking for HCLabels == 1
#data[data.HCLabels == 1]
In [113]:
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']

mean = data_model.groupby('HCLabels_k4').mean()

median = data_model.groupby('HCLabels_k4').median()

df_hierachical = pd.concat([mean, median], axis = 0)

df_hierachical.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
                        'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_hierachical[cols_visualise].T
Out[113]:
group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_0 Median group_1 Median group_2 Median group_3 Median
Income 51707.923994 44234.909534 2447.0 51633.638134 51569.0 29388.819067 2447.0 51633.638134
MntWines 306.095816 27.000000 1.0 32.000000 179.0 11.000000 1.0 32.000000
MntFruits 26.479982 2.750000 1.0 2.000000 8.0 3.000000 1.0 2.000000
MntMeatProducts 164.463788 12.750000 1725.0 1607.000000 68.0 7.500000 1725.0 1607.000000
MntFishProducts 37.782726 2.750000 1.0 12.000000 12.0 2.500000 1.0 12.000000
MntSweetProducts 27.024291 132.750000 1.0 4.000000 8.0 133.000000 1.0 4.000000
MntGoldProds 43.893387 244.250000 1.0 22.000000 25.0 306.000000 1.0 22.000000
NumDealsPurchases 2.320288 0.000000 15.0 0.000000 2.0 0.000000 15.0 0.000000
NumWebPurchases 4.067926 25.500000 0.0 0.000000 4.0 26.000000 0.0 0.000000
NumCatalogPurchases 2.633828 0.250000 28.0 0.000000 2.0 0.000000 28.0 0.000000
NumStorePurchases 5.829960 0.250000 0.0 1.000000 5.0 0.000000 0.0 1.000000
NumWebVisitsMonth 5.336932 0.750000 1.0 0.000000 6.0 1.000000 1.0 0.000000
Total_Amount_Spent 605.739991 422.250000 1730.0 1679.000000 397.0 387.500000 1730.0 1679.000000
Total_Purchases 14.852002 26.000000 43.0 1.000000 15.0 26.000000 43.0 1.000000
Total_Offers_Accepted 0.449393 0.000000 0.0 1.000000 0.0 0.000000 0.0 1.000000
In [114]:
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']

#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'HCLabels_k4', palette = 'Dark2')
    
plt.show()

Observation:¶

Hmmm!!! Scatter plot says its one big cluster.

Cluster Profiling¶

In [115]:
for col in cols_visualise:
    sns.boxplot(x = 'HCLabels_k4', y = col, data = data_model)
    plt.show()

Observations and Insights: Can't deduce any insights, since no clusters.

Characteristics of each cluster¶

Summary of each cluster:

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [116]:
# Range for eps and min-samples
eps_range = np.arange(1, 2)
min_samples_range = range(2, 12)

# Load your data here or create a random dataset for testing
#X = np.random.randn(100, 2)

# Create an empty array to store the silhouette scores
silhouette_scores = np.zeros((len(eps_range), len(min_samples_range)))

# Loop over all combinations of eps and min-samples
for i, eps in enumerate(eps_range):
    for j, min_samples in enumerate(min_samples_range):
        
        # Create a DBSCAN object with the current hyperparameters
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        
        # Fit the DBSCAN model to the data
        dbscan.fit(data_pca)
        
        
        # Calculate the silhouette score for the current clustering
        if len(set(dbscan.labels_)) > 1:  # check if there's more than one cluster
            silhouette_scores[i, j] = silhouette_score(data_pca, dbscan.labels_)
        else:
            silhouette_scores[i, j] = -1  # set the score to -1 if there's only one cluster
In [117]:
# Print the silhouette scores for each combination of hyperparameters
print("Silhouette scores:")
for i, eps in enumerate(eps_range):
    for j, min_samples in enumerate(min_samples_range):
        print(f"eps={eps}, min_samples={min_samples}: {silhouette_scores[i, j]:.3f}")
Silhouette scores:
eps=1, min_samples=2: -0.254
eps=1, min_samples=3: -0.355
eps=1, min_samples=4: -0.227
eps=1, min_samples=5: -0.234
eps=1, min_samples=6: -0.217
eps=1, min_samples=7: -0.206
eps=1, min_samples=8: -0.210
eps=1, min_samples=9: -0.114
eps=1, min_samples=10: -0.114
eps=1, min_samples=11: -0.145
In [118]:
# Find the highest silhouette score and its hyperparameters
best_score = np.max(silhouette_scores)
best_eps_idx, best_min_samples_idx = np.unravel_index(np.argmax(silhouette_scores), silhouette_scores.shape)
best_eps, best_min_samples = eps_range[best_eps_idx], min_samples_range[best_min_samples_idx]
In [119]:
# Print the best hyperparameters and the highest silhouette score
print(f"The best hyperparameters are eps={best_eps} and min_samples={best_min_samples}, with a silhouette score of {best_score:.3f}.")
The best hyperparameters are eps=1 and min_samples=10, with a silhouette score of -0.114.

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [120]:
# Create a DBSCAN object with the best hyperparameters from above
eps = 1
min_samples = 8
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        
# Fit the DBSCAN model to the data
dbscan.fit(data_pca)

data_pca_copy['DBSLabels_k4'] = dbscan.fit_predict(data_pca)

data_model['DBSLabels_k4'] = dbscan.fit_predict(data_pca)
In [121]:
# DBSCAN Clustering Silhouette Score
labels = dbscan.fit_predict(data_pca)
dbscan_sc = silhouette_score(data_pca, labels)
print("DBSCAN Clustering Silhouette Score: ", round(dbscan_sc, 4))
DBSCAN Clustering Silhouette Score:  -0.2097
In [122]:
data_model['DBSLabels_k4'].value_counts()
Out[122]:
-1    2142
 0      58
 1      13
 2       8
 3       8
Name: DBSLabels_k4, dtype: int64
In [123]:
# Calculating the mean & median
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']

mean = data_model.groupby('DBSLabels_k4').mean()

median = data_model.groupby('DBSLabels_k4').median()

df_hierachical = pd.concat([mean, median], axis = 0)

df_hierachical.index = ['group_-1 Mean', 'group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
                        'group_-1 Median', 'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_hierachical[cols_visualise].T
Out[123]:
group_-1 Mean group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_-1 Median group_0 Median group_1 Median group_2 Median group_3 Median
Income 52520.696861 29596.304106 29956.538462 27722.375 43825.375 52372.5 28691.0 30992.0 27663.5 45494.0
MntWines 317.171802 10.655172 10.692308 12.875 43.750 198.0 7.0 9.0 11.0 41.5
MntFruits 27.371615 3.258621 2.923077 2.250 0.500 9.0 2.5 2.0 1.0 0.0
MntMeatProducts 171.810458 11.706897 11.384615 7.750 9.875 73.0 11.0 13.0 7.5 10.5
MntFishProducts 39.057890 4.431034 4.769231 3.625 0.625 13.0 3.0 3.0 3.0 0.0
MntSweetProducts 28.187208 2.706897 4.153846 2.500 0.375 9.0 1.5 3.0 2.5 0.0
MntGoldProds 45.756303 7.068966 6.538462 5.125 3.625 26.0 6.0 5.0 5.0 4.0
NumDealsPurchases 2.360878 1.310345 1.153846 1.125 2.000 2.0 1.0 1.0 1.0 2.0
NumWebPurchases 4.211485 1.413793 1.384615 1.125 1.875 4.0 1.0 1.0 1.0 2.0
NumCatalogPurchases 2.744631 0.068966 0.000000 0.000 0.125 2.0 0.0 0.0 0.0 0.0
NumStorePurchases 5.939309 2.672414 2.769231 2.875 3.250 5.0 3.0 3.0 3.0 3.0
NumWebVisitsMonth 5.252101 7.293103 7.076923 7.000 5.875 6.0 7.0 7.0 7.0 6.0
Total_Amount_Spent 629.355275 39.827586 40.461538 34.125 58.750 425.0 37.0 43.0 34.0 54.5
Total_Purchases 15.256303 5.465517 5.307692 5.125 7.250 16.0 5.0 5.0 5.0 7.0
Total_Offers_Accepted 0.466853 0.000000 0.000000 0.000 0.000 0.0 0.0 0.0 0.0 0.0
In [124]:
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']

#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'DBSLabels_k4', palette = 'Dark2')
    
plt.show()

Observation:¶

Again no clusters to analyse.

In [125]:
for col in cols_visualise:
    sns.boxplot(x = 'DBSLabels_k4', y = col, data = data_model)
    
    plt.show()

Observations and Insights:

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

Comment¶

Yes, changing the eps & min_sample will result in different results. Since its a trail and error, it might take so many iterations to reach clear demarkation of clusters. Also I am running out of time to finish up and submit.

Characteristics of each cluster¶

Summary of each cluster:

  • The general customer behavior trend did not change, like what we have seen in K-Mean thru this algorithm. Mean and Median numbers further refined. Group_-1 seems affluent group and spending more, where rest of the groups are prudent with less spending.
  • *group_-1!!! those may outliers??***

Gaussian Mixture Model¶

In [126]:
# GMM initialized with clusters = 4 and random_state = 1
gmm = GaussianMixture(n_components = 4, random_state = 1)

gmm.fit(data_pca) # Fitting the model

labels = gmm.predict(data_pca)

data_pca_copy['GmmLabels_k4'] = gmm.predict(data_pca)

data_model['GmmLabels_k4'] = gmm.predict(data_pca)

silhouette_scores = silhouette_score(data_pca, labels)
print("GMM Silhouette Score: ", round(silhouette_scores, 3))
GMM Silhouette Score:  0.181
In [127]:
data_model.GmmLabels_k4.value_counts()
Out[127]:
0    580
3    561
1    547
2    541
Name: GmmLabels_k4, dtype: int64

Observation:¶

  • Cool, we are back to some meaningful distribution!!!
In [128]:
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
                  'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
                  'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']

mean = data_model.groupby('GmmLabels_k4').mean()

median = data_model.groupby('GmmLabels_k4').median()

df_gmm = pd.concat([mean, median], axis = 0)

df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean','group_3 Mean',
                'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_gmm[cols_visualise].T
Out[128]:
group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_0 Median group_1 Median group_2 Median group_3 Median
Income 58988.666322 42364.727797 76434.511188 29304.335456 59422.0 42557.0 76982.0 29478.0
MntWines 485.991379 72.396709 631.449168 31.197861 423.0 44.0 575.0 12.0
MntFruits 27.401724 4.652651 68.469501 6.058824 16.0 2.0 54.0 3.0
MntMeatProducts 149.025862 27.137112 468.961183 24.953654 131.0 17.0 430.0 15.0
MntFishProducts 35.770690 6.605119 100.959335 8.976827 20.0 3.0 93.0 4.0
MntSweetProducts 27.437931 4.851920 71.393715 6.094474 15.0 2.0 58.0 3.0
MntGoldProds 66.412069 15.462523 77.704251 17.040998 47.0 9.0 57.0 11.0
NumDealsPurchases 3.720690 2.427788 1.181146 1.868093 3.0 2.0 1.0 1.0
NumWebPurchases 6.712069 2.372943 5.036969 2.190731 7.0 2.0 5.0 2.0
NumCatalogPurchases 3.289655 0.782450 6.048059 0.491979 3.0 1.0 6.0 0.0
NumStorePurchases 8.150000 3.676417 8.316081 3.074866 8.0 3.0 8.0 3.0
NumWebVisitsMonth 5.668966 5.804388 2.746765 6.985740 6.0 6.0 2.0 7.0
Total_Amount_Spent 792.039655 131.106033 1418.937153 94.322638 732.0 90.0 1383.0 60.0
Total_Purchases 21.872414 9.259598 20.582255 7.625668 21.0 9.0 20.0 7.0
Total_Offers_Accepted 0.400000 0.133455 1.055453 0.221034 0.0 0.0 1.0 0.0
In [129]:
for col in cols_visualise:
    sns.boxplot(x = 'GmmLabels_k4', y = col, data = data_model)
    
    plt.show()

Observations and Insights:

  • group_2 seems more affluent when compared to other groups, with other groups in the order of group_0, group_1 and group_3 with respect to mean Income.
  • group_2 and group_0 bought more products when compared to the other two groups.
  • It seems group_1 did more deal purchases when compared to others and most of those purchases could be on web.
  • group_2, followed by group_1 seems to be taking advantage of the offers offered.

**Similar behavior found with K-Medoids algorithm also.

Visualize the clusters using PCA¶

In [130]:
#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'GmmLabels_k4', palette = 'Dark2')
    
plt.show()

Observation:¶

  • This scatter plot has more outliers!!, I would say more data overlap for one group, when compared to the K-Means and K-Medoids scatter plots.

Cluster Profiling¶

Observations and Insights:

**Results from this algorithm are very much looking like K-Means & K-Medoids algorithms.

Characteristics of each cluster¶

Summary of each cluster:

group_2¶

This group_2 seems to be most affluent group out of the four groups. Their mean income is about 76K and spending more on Wines and meat products. The next tier of shopping is on Fish, Gold, Sweet and fruit products. This group doing more shopping at stores, followed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. When compared to other groups their number of web visits are less, but they are eaually spending on Web and catalog products, after store purchases.

group_0¶

From Income perspective this is the next group with more income after group_2. This group's mean income is around 59k. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spending more on store and web purchases and then catalog and deals.

group_1¶

This group's mean income is around 42K. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are 2nd highest among the groups in visiting the web, but their shopping spread almost equally between store and web purchases, followed by deal purchases. In comparison, this group accepted very less number of offers.

group_3¶

This is less affluent group when compared to the other groups with mean income around 29K. Their top spending is on wines, folloed by meat and then gold products. This group visits web more often than any other groups, but their spending is very prudent, in the order of store, deals and then web purchases. Their usage of catalog channel is negligible.

Executive Summary¶

Key Takeaways¶

  • Identify and focus on the big picture first and all of its components

    • Conducted this investigation and analysis on the given customer dataset to identify hidden patterns of customer behaviors, spending habits and individual customer preferences to optimize the marketing budget, make informed decisions and in-turn increase the organization's ROI.
  • These components are usually the driving force for the end goal

    • We tried to group the given customers by employing various unsupervised learning techniques, to identify most effective marketing channels and identify new market opportunities.
  • Summarize the most important findings and takeaways in the beginning

    • Customers in all the groups spent more on wines and meat products. So stores, as well as web should make these products available in near proximity, as well as have some tips to customers like which wine goes best with what kind of meat product.

Next Steps¶

  • Steps that can taken to improve the solution

    • There is an ample opportunity to enhance the data quality and reduce the number of outliers in various data variables.
    • This is iterative process, so we have to keep improving the solution by adding new features or delete features which are not adding value to the solution.
  • How to make the best of the solution?

    • Data Scientists or Analysts interpret the solution results in understanding way to organizations's stakeholders.
    • Provide visualizations to help them understand the data.
    • Periodically compare the results of various clustering algorithms to identify cost effective solution or to improve the current solution.
    • By following the documented recommendations, the organization can reap the benefits of this solution.
  • What are the steps to be followed by the stakeholders

    • Stakeholders are involved in every step of the solution. They need to define the problem, identify the data, define the solution performance measurement metric, and interpret the clustering results and understand how those findings will impact the organization.

Problem and Solution Summary¶

What problem was being solved?¶

  • Summary of the problem
    • Trying to investigate and analyse the hidden patterns in the given dataset, while using the combination of mathematics and computer science to solve business issues and provide actionable options for business leaders. Using the machine learning models and algorithms, automate the complex business processes and Identify new business opportunities for organization. Help the organization devise targeted marketing campaigns tailored towards individual customers.

Final proposed solution design¶

  • What are the key points that describe the final proposed solution design?
    1. Importing various libraries to carry out this analssis. Ex: numpy, pandas, matplotlib.pyplot, seaborn, sklearn, scipy etc.,
    2. Load the data and understand.
    3. Get the feel of the data by checking the data types and missing values.
    4. Conduct Exploratory Data Analysis (EDA)
      • Explore each variable in the dataset for range of values and central tendency.
        • Univariate Anaysis - Numerical Data & Categorical Data
        • Bivariate Analysis
    5. Feature Engineering and Data preprocessing
      • Creating new columns
      • Imputing missing values
    6. Preparing data for segmentation
      • Dropping some columns which we will not sue for segmentation.
    7. Scaling the Data
      • Scaling needed to normalize the data for unsuppervised learning algorithms, under the hood who uses distance to find the similar datapoints/observations and also to minimize the impact of one feature on the other.
    8. Data Visualizations in 2D or 3D, by employing dimensionality reduction techniques.
      • t-distributed stochastic neighbor embedding (t-SNE)
      • Principal Component Analysis (PCA)
    9. Apply various clustering algorithms on the data after applying PCA.
      • 1.K-Means
      • 2.K-Medoids
      • 3.Hierarchical Agglomerative Clustering
      • 4.DBSCAN
      • 5.Gaussian Mixture Model (GMM)
    10. Performance evaluation of various models using Silhouette Score metric

Why is this a valid solution that is likely to solve the problem¶

  • The reason for the proposed solution design

    • For the given dataset, before applying the above solution desing we don't know the structure or relatioships within the data, nor we don't know what data represents and we have no prior pattern knowledge what we are looking for. So with the above solution design tryng to identify the customer segmentation using distance as metric between data points.
  • How it would affect the problem/business?

    • The above solution design allows us the find the hidden patterns and various relationships in the data, which will help the organization's stakeholders to make informed decisions on.....
      • in optimizing the marketing budget
      • in creating targeted marketing campaigns
      • in offering tailored products to customers
      • in analysing customer feedback and enhancing the product functionalty etc.,

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

Silhouette Scores:

  • K-Means: 0.1351
  • K-Medoids: 0.1071
  • Hierarchical: 0.4721
  • DBSCAN: -0.2097
  • Gaussian Mixture Model: 0.181
  • I was expecting K-Medoids or GMM to perform better, but surprise Hierarchical Clustering algorithm has highest Silhouette Score when compared to K-Means, K-Medoids and GMM. But remember Hierarchical has 2223 observations in one group and 4, 1, 1 observatiosn in rest of three groups. Similarly DBSCAN has 2142 in one group and 58, 13, 8, 8 in other respective groups. One way to look at them is so many outliers. From that perspective I will not select Hierarchical & DBSCAN for this given dataset.

  • Feature enhancement as well as inclusion may improve the information in the dataset, may reduce the outliers/noise. After few trials and errors, and right parameter tuning might help us to use and interpret Hierarchical Agglomerative Clustering algorithm also on this given dataset.

2. Refined insights:

  • What are the most meaningful insights from the data relevant to the problem?
    • Even though we identified only four clusters, looking at the Dendrograms from Hierarchical algorithm, got an hunch that there are so many sub populations in the given dataset. We can further use the GMM to identify these sub groups and may be able to create a personalized experience to customers from the insights gained after the sub population analysis. Ex: Like providing more meat and wine products to the affulent customers group whose mean income is around 76K. Also this group is doing more store purchases than web shopping. So showcase the products in the store in such way that they attract this group of customers.
    • Make Data Driven decisions. Ex: The most prudent group_3 whose mean income is around 29K, has more web visits. So we have to showcase our offerings more on web to this group.

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?
    • For this given dataset I propose Gaussian Mixture Model (GMM) & K-Medoids. Looking at the GMM scatter plot (visualizing using PCA data), I see some data overlap in the given dataset and we all know GMM handles the data overlap better than other algorithms. Even though the K-Means Silhouette Score is higher than K-Medoids, I am going with K-Medoids, since the centroids of K-Means clustering don't have business meaning and very sensitive to outliers..

What are some key recommendations to implement the solution?

  • Clearly define the problem and objective of the clustering solution.
  • Preprocess and clean the data, handle missing values and scale the data to reduce the variable bias.
  • Choose the algorithm that is best suited for your problem dataset. Ex: K-Means, K-Medoids, Hierarchical, DBSCAN, Gaussian Mixture
  • Employ the available methods like elbow method or the Silhouette Score etc., to determine the number of clusters
  • Run the clustering algorithm and evaluate the results by checking for any outliers, and analysing the cluster characteristics. Also interpret the results and look for any patterns in the give dataset.
  • In summary to implement a clustering solution we need planning, data preparation, algorithm selection and skillful interpretation of the clustering results to provide valuable business insights.

What are the key actionables for stakeholders?

  • Based on clustering results and insights, stakeholders might create targeted campaigns offerings for identified segments.
  • Based on insights, may be able to design or enhance the customer experience.
  • Based on the patterns identified in the clustering solution, may improve the product offerings.
  • If clustering solution identified any ineffiencies in operations then may streamline the processes and procedures to improve efficiencies.
  • Take actions to mitigate the risk to acceptable levels

What is the expected benefit and/or costs?

  • List the benefits of the solution

    • We will have the better understanding of the customer behavior, their spending patterns and knowledge about their preferences.
    • Once we have the better understanding of the customer, and the underlysing data then we will be able to design efficient processes, make informed decisions to optimize our marketing budget and increase the ROI.
  • Take some rational assumptions to put forward some numbers on costs/benefits for stakeholders

    • The given customer dataset has information about past data with demographics, purchase history, and website activity (num of visits). We the data Science team are performing the clustering analysis and going to interpret the results. We expect this clustering solution will increase customer engagement and sales.
  • *Estimated Costs: Data cleaning and preprocessing: 4,000 Clustering algorithm development and testing: 12,000 Data visualization and interpretation: 6,000 Targeted marketing campaign development: 20,000 Implementation and monitoring: 25,000 Total cost: 67,000*

  • *Estimated Benefits: Increased customer engagement: 10% increase in website visits and 5% increase in average purchase amount. Increased sales: 500,000 dollars in additional revenue over the next 12 months. Assuming that the company has a profit margin of 10%*

  • *Net Estimated Benefit: Net benefit = (Revenue - Costs) Profit margin Net benefit = (500,000 - 67,000) 10% Net benefit = 43,300 Based on these assumptions, the clustering solution would generate a net benefit of 43,300 dollars over the next 12 months. This suggests that the clustering solution is a worthwhile investment for the company, as the benefits outweigh the costs. However, it is important to note that these are hypothetical estimates*.

What are the key risks and challenges?

  • What are the potential risks or challenges of the proposed solution design
    • Tried our best to avoid the over fitting of the data, but still a slight chance of over fitting or underfitting the data during preprocessing.
    • Silhouette Score performance metric suggesting the Hierarchical agglomerative algorithm, since I couldn't recognize clusters with that algorithm, I went with GMM, paired with K-Medoids. If GMM is not the right choice then potential for misleading or inaccurate customer segmentation.
    • If the chosen number of clusters is not optimal then my insights could inaccurate or misleading.
    • Due to the time lapsed between the data collection, the insights gained may be obsolete.
    • The hidden patterns identified may not be in play due to customer behavior change influenced by external factors like pandemic(s) or natural disasters etc

What further analysis needs to be done or what other associated problems need to be solved?

  • Need to determine the optimal number of clusters for the given dataset.
  • May need to run with higher cluster numbers to see if we gain any further insights than what we were able to deduce with k=4 clusters.
  • Need to figure out how much information loss we are having by deleting all the demographic information from the model. Which should help us in determining how much demographic information we need to retain in the model to get meaningful and actionable insights.